REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs
نویسندگان
چکیده
We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of Õ(HS √ AT ). We also relate the span to various diameter-like quantities associated with the MDP, demonstrating how our results improve on previous regret bounds.
منابع مشابه
Learning Unknown Markov Decision Processes: A Thompson Sampling Approach
We consider the problem of learning an unknown Markov Decision Process (MDP) that is weakly communicating in the infinite horizon setting. We propose a Thompson Sampling-based reinforcement learning algorithm with dynamic episodes (TSDE). At the beginning of each episode, the algorithm generates a sample from the posterior distribution over the unknown model parameters. It then follows the opti...
متن کاملA unified view of entropy-regularized Markov decision processes
We propose a general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs). Our approach is based on extending the linear-programming formulation of policy optimization in MDPs to accommodate convex regularization functions. Our key result is showing that using the conditional entropy of the joint state-action distributions as regularization...
متن کاملEfficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning
We introduce SCAL, an algorithm designed to perform efficient exploration-exploitation in any unknown weakly-communicating Markov Decision Process (MDP) for which an upper bound c on the span of the optimal bias function is known. For an MDP with S states, A actions and Γ ≤ S possible next states, we prove a regret bound of Õ(c √ ΓSAT ), which significantly improves over existing algorithms (e....
متن کاملMini/Micro-Grid Adaptive Voltage and Frequency Stability Enhancement Using Q-learning Mechanism
This paper develops an adaptive control method for controlling frequency and voltage of an islanded mini/micro grid (M/µG) using reinforcement learning method. Reinforcement learning (RL) is one of the branches of the machine learning, which is the main solution method of Markov decision process (MDPs). Among the several solution methods of RL, the Q-learning method is used for solving RL in th...
متن کاملDynamic Obstacle Avoidance by Distributed Algorithm based on Reinforcement Learning (RESEARCH NOTE)
In this paper we focus on the application of reinforcement learning to obstacle avoidance in dynamic Environments in wireless sensor networks. A distributed algorithm based on reinforcement learning is developed for sensor networks to guide mobile robot through the dynamic obstacles. The sensor network models the danger of the area under coverage as obstacles, and has the property of adoption o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009